Soeren Sonnenburg: Shogun at Google Summer of Code 2012
The summer came finally to an end and (yes in Berlin we still had 20 C end of
October), unfortunately, so did GSoC with it. This has been the second time for
SHOGUN to be in GSoC. For those unfamiliar with SHOGUN - it is a very versatile
machine learning toolbox that enables unified large-scale learning for a broad
range of feature types and learning settings, like classification, regression,
or explorative data analysis. I again played the role of an org admin and
co-mentor this year and would like to take the opportunity to summarize
enhancements to the toolbox and my GSoC experience: In contrast to last year,
we required code-contributions in the application phase of GSoC already, i.e.,
a (small) patch was mandatory for your application to be considered. This
reduced the number of applications we received: 48 proposals from 38 students
instead of 70 proposals from about 60 students last year but also increased the
overall quality of the applications.
In the end we were very happy to get 8
very talented students and have the opportunity of boosting the project thanks
to their hard and awesome work. Thanks to google for sponsoring three more
students compared to last GSoC. Still we gave one slot back to the pool
for good to the octave project (They used it very wisely and octave will have a
just-in-time compiler now, which will benefit us all!).
SHOGUN 2.0.0 is the new release of the toolbox including of course all the new
features that the students have implemented in their projects. On the one hand,
modules that were already in SHOGUN have been extended or improved. For example,
Jacob Walker has implemented Gaussian Processes (GPs) improving the usability
of SHOGUN for regression problems. A framework for multiclass learning by
Chiyuan Zhang including state-of-the-art methods in this area such as
Error-Correcting Output Coding (ECOC) and ShareBoost, among others. In addition,
Evgeniy Andreev has made very important improvements w.r.t. the accessibility of
SHOGUN. Thanks to his work with SWIG director classes, now it is possible to use
python for prototyping and make use of that code with the same flexibility as if
it had been written in the C++ core of the project. On the other hand,
completely new frameworks and other functionalities have been added to the
project as well. This is the case of multitask learning and domain adaptation
algorithms written by Sergey Lisitsyn and the kernel two-sample or dependence
test by Heiko Strathmann. Viktor Gal has introduced latent SVMs to SHOGUN and,
finally, two students have worked in the new structured output learning
framework. Fernando Iglesias made the design of this framework introducing the
structured output machines into SHOGUN while Michal Uricar has implemented
several bundle methods to solve the optimization problem of the structured
output SVM.
It has been very fun and interesting how the work done in different projects has
been put together very early, even during the GSoC period. Only to show an
example of this dealing with the generic structured output framework and the
improvements in the accessibility. It is possible to make use of the SWIG
directors to implement the application specific mechanisms of a structured
learning problem instance in python and then use the rest of the framework
(written in C++) to solve this new problem.
Students! You all did a great job and I am more than amazed what you all have
achieved. Thank you very much and I hope some of you will stick around.
Besides all these improvements it has been particularly challenging for me as
org admin to scale the project. While I could still be deeply involved in each
and every part of the project last GSoC, this was no longer possible this year.
Learning to trust that your mentors are doing the job is something that didn't
come easy to me. Having had about monthly all-hands meetings did help and so
did monitoring the happiness of the students. I am glad that it all worked out
nicely this year too. Again, I would like to mention that SHOGUN improved a lot
code-base/code-quality wise. Students gave very constructive feedback about our
(lack) of proper Vector/Matrix/String/Sparse Matrix types. We now have all
these implemented doing automagic memory garbage collection behind scenes. We
have started to transition to use Eigen3 as our matrix library of choice,
which made quite a number of algorithms much easier to implement. We
generalized the Label framework (CLabels) to be tractable for not just
classification and regression but multitask and structured output learning.
Finally, we have had quite a number of infrastructure improvements. Thanks to
GSoC money we have a dedicated server for running the buildbot/buildslaves and
website. The ML Group at TU Berlin does sponsor virtual machines for building
SHOGUN on Debian and Cygwin. Viktor Gal stepped up providing buildslaves for
Ubuntu and FreeBSD. Gunnar Raetschs group is supporting redhat based build
tests. We have Travis CI running testing pull requests for breakage even before
merges. Code quality is now monitored utilizing LLVMs scan-build. Bernard
Hernandez appeared and wrote a fancy new website for SHOGUN.
A more detailed description of the achievements of each of the students follows:
- Kernel Two-sample/Dependence test
- Student: Heiko Strathmann
- Mentor: Arthur Gretton, Soeren Sonnenburg
- Implement multitask and domain adaptation algorithms
- Student: Sergey Lisityn
- Mentor: Christian Widmer
- Implementation of / Integration via existing GPL code of latent SVMs.
- Student: Viktor Gal
- Mentor: Alexander Binder
- Bundle method solver for structured output learning
- Student: Michal Uricar
- Mentor: Vojtech Franc
- Built generic structured output learning framework
- Student: Fernando Jose Iglesias Garcia
- Mentor: Nico Goernitz
[2] Support Vector Machine Learning for Interdependent and Structured Output Spaces. - Improving accessibility to shogun
- Student: Evgeniy Andreev
- Mentor: Soeren Sonnenburg
- Implement Gaussian Processes and regression techniques
- Student: Jacob Walker
- Mentor: Oliver Stegle
- Build generic multiclass learning framework
- Student: Chiyuan Zhang
- Mentor: Cheng Soon Ong